The rise of genomics in snake venom research: recent advances and future perspectives

Abstract Snake venoms represent a danger to human health, but also a gold mine of bioactive proteins that can be harnessed for drug discovery purposes. The evolution of snakes and their venom has been studied for decades, particularly via traditional morphological and basic genetic methods alongside venom proteomics. However, while the field of genomics has matured rapidly over the past 2 decades, owing to the development of next-generation sequencing technologies, snake genomics remains in its infancy. Here, we provide an overview of the state of the art in snake genomics and discuss its potential implications for studying venom evolution and toxinology. On the basis of current knowledge, gene duplication and positive selection are key mechanisms in the neofunctionalization of snake venom proteins. This makes snake venoms important evolutionary drivers that explain the remarkable venom diversification and adaptive variation observed in these reptiles. Gene duplication and neofunctionalization have also generated a large number of repeat sequences in snake genomes that pose a significant challenge to DNA sequencing, resulting in the need for substantial computational resources and longer sequencing read length for high-quality genome assembly. Fortunately, owing to constantly improving sequencing technologies and computational tools, we are now able to explore the molecular mechanisms of snake venom evolution in unprecedented detail. Such novel insights have the potential to affect the design and development of antivenoms and possibly other drugs, as well as provide new fundamental knowledge on snake biology and evolution.


Innovationsfonden (9065-00007B)
Not applicable COFUNDfellowsDTU (713683) Not applicable Abstract: Snake venoms represent a danger to human health, but also a goldmine of bioactive proteins that can be harnessed for drug discovery purposes. The evolution of snakes and their venom has been studied for decades, particularly via traditional morphological and basic genetic methods alongside venom proteomics. However, while the field of genomics has matured rapidly over the past two decades due to the development of Next Generation Sequencing (NGS) technologies, snake genomics remains in its infancy. Here, we provide an overview of the state-of-the-art in snake genomics and discuss its potential implications for studying venom evolution and toxinology. Based on current knowledge, gene duplication and positive selection are key mechanisms in the neofunctionalization of snake venom proteins. This makes snake venoms important evolutionary drivers that explain the remarkable venom diversification and adaptative variation observed in these reptiles. Gene duplication and neofunctionalization have also generated a large number of repeat sequences in snake genomes that pose a significant challenge to DNA sequencing, resulting in the need for substantial computational resources and longer sequencing read length for high quality genome assembly. Fortunately, owing to constantly improving sequencing technologies and computational tools, we are now able to explore the molecular mechanisms of snake venom evolution in unprecedented detail. Such novel insights have the potential to impact the design and development of antivenoms and possibly other drugs, as well as provide new fundamental knowledge on snake biology and evolution.
gene islands. They are hard too assemble without PacBIO HiFi or Minion long reads, and this should be mentioned extensively in a paper like this. One. Response: Good point, we agree with the Reviewer that long-read technology should be specifically mentioned here, as they are essential for the assembly of continuous genomes. We have therefore added two sentences in this paragraph that hopefully make the point clearer now, with a reference to Figure 2 where the long-read technology is also mentioned (lines 232-236).
Lines 346-347: A few lines to convince the reader that these differences aren't attributable to differences in the sequencing technologies (and therefore the ability to recover repetitive elements) would be appropriate here. Response: We have included a paragraph exemplifying that the sequencing pproaches are not responsible for the registeres differences and references accordingly (lines 369-373) Line 366-367: Where is the hypothesis coming from? There is no citation and it hasn't been previously mentioned. Why would microsatellites impact venom evolution? Are they enriched near venom genes? Otherwise this pure speculation needs to be removed. Response: We have removed the statement.
Lines 382 to 387: This is presented as if these are surprising instances of convergence. Rather, these are clearly homologous locations of these tandemly arrayed gene islands. Response: We have modified the text clarified the poinr and included a statement abour the higher GC-content and faster recombination of macrocromosomes (lines 406-407 and 414-416).
Line 479-484: This is confusingly written. Which gene family(s) are you talking about? Name them rather than just saying "Toxin genes". The presentation of the idea in figure 3 doesn't help the confusion, but rather worsens it. MP, SP, CTL, etc., are large gene superfamilies, so of course they are all over the genome of vertebrates, and very few are venom. You need to undertake much clearer writing to get your point across and link the verbiage here to the figure, or remove this content about whole genome duplications completely and just stick to the better understood components of venom evolution. If you think there is clear evidence for an importance of these more ancient WGD events in venom gene evolution, then it certainly merits a place in the paper. But if so, this needs much more detail to differentiate the idea of more recent common ancestors of the venom progenitor genes from the broader gene superfamilies (most of which are housekeeping genes) to which these venom genes belong. Response: We have detailed the toxin gene families studied in the P. flavoridis genome and modified the wording to point across figure 3.
Line 486-491: This is a major misrepresentation of these results. The scaffolds presented are tiny, on the order of only kilobases. Based on the more complete chromosome-scale assemblies done in Crotalus, it is clear that SVSP, SVMP, PLA2, and CTLP all occur in single gene clusters in pitvipers. You need to differentiate genomic scaffolds (a technical artifact of assembly quality) from gene clusters (a biological reality of genomic architecture). And then when doing so, it will always be best to default to the findings of those genomes that are done at chromosome scale, rather than in thousands of tiny scaffolds, until better genomes come out for the other species. See work from Sean Carroll's lab scaffolding individual gene familes, or the Crotalus viridis or C. tigris genomes. Response: Thank you for pointing out. We acknowledge the misunderstanding in the terminology. We have modified the text to make it clear that we are referring to gene clusters rather than scaffolds (lines 498-503).

Minor:
Lines 117-120: Unclear why you would mention RADseq, as it is highly unlikely to hit a venom gene. Alternately, you could expand that it could help reveal potential population demographic trends that underly venom variation, as we did here: https://academic.oup.com/biolinnean/article/132/2/297/6042638?login=true Response: We thank the Reviewer for a good point here. We have chosen to still make a mention about the use of RAD-seq technology, to exemplify to the community another RRS technique that could be useful to study venom evolution. We have modified the sentence as suggested by the Reviewer by adding the Holding et al. study as a direct example of the use of RAD-seq for this purpose while acknowledging that it is less suitable for the study of venom genes (lines 119-124).
Line 297: This also implies greater overall gene density in T. sirtalis (genes/Mb) and potentially differences is repetitive element content. The authors should list out these possible causes and then review which are supported by the literature, rather than just mentioning average intron length differences, which amount to a relatively small overall amount of any genome. Tim Response: We acknowledge that this paragraph needed clarification. We have modified the text to include the considerations about the gene density and included references to support the statements (lines 311-319).
Line 322: There is little to no evidence that "environmental conditions" impact venom gene expression. Clarify this statement. What does matter is population of origin, age of snake, and time since last expulsion of venom. Response: We have clarified the point as suggested and included supporting references (lines 343-344).
Line 336: "fluid genome" is a very imprecise term. I think you want to say "high degree of evolvability in structural features of the genome". Response: We thank the Reviewer for the suggestion. We have chosen to remove the term and change it for the suggestion made by the Reviewer (lines 357-358).
Line 468: "chromatin-wise" is unclear wording. Reword. Response: We have modified the text to emphasize that highly expressed toxin genes are usually on regions where chromatin is available and the methylation allows for higher transcription of the genes, when compared with non toxin genes (lines 491-492).
Line 470: "metavenom network" is also not a generally known, well-defined term. Define or reword to be more specific.
Response: We have reworded the sentence and defined "metavenom network" in the text (line 495).
Line 479-484: This is confusingly written. Which specific gene family(s) are you talking about? Name them rather than just saying "Toxin genes". Response: We have modified the text and included a description of the specific toxin families (lines 504-513)

State-of-the art in snake genomics 216
With the rapid development of high-throughput sequencing technology, large-scale genomic projects 217 have generated rich sequence information data of billions of base pairs and have paved the way for a 218 new era in the field of phylogenetics, whereby the evolutionary history of organisms can be 219 reconstructed from genomic data. The supermatrix method is the most well-known approach for 220 analyzing concatenation of multiple gene sequences, and using genomic data sets with improved 221 resolution can potentially mitigate phylogenetic problems previously caused by sampling errors [81]. 222 However, since only 21 (approximately 0.6%) out of the ca. 3,600 existing snake species have 223 undergone WGS so far [9,15,17,18,[20][21][22][23][24][25][26][27][28]30,[82][83][84][85], snake genomics will likely develop significantly 224 in the coming years. A complete list of currently available snake genomes is provided below in Table  225 2. 226 228 GC% refers to the percentage of the Guanine (G) and Cytosine (C) bases in a genome, scaffold N50 is a measure of the assembly quality (see 229 below), DoC is a measure of the depth of coverage (see below), and INSDC ID is the NCBI gene bank accession number of the respective 230 genome. PB stands for PacBio and IL for Illumina. 231

15
Available snake genomes differ notably in their assembly and annotation qualities, which makes 232 evaluation of genome quality an important factor in determining the suitability of a genome for 233 addressing a given set of questions. For instance, while estimation of nucleotide composition and 234 genomic repeat content can be achieved from a relatively fragmented genome assembly, high-quality 235 genome assemblies are required for analyses of multi-gene families and regulatory elements [86]. The 236 reason for this is that the majority of the known venom gene families form tandem-arrayed "gene 237 islands" (significantly enriched in microchromosomes, see e.g. [13]), which generally represent a 238 challenge for performing a continuous assembly. In order to achieve the best quality of assembly of 239 venom genes, the use of long-read technology (e.g. PacBIO HiFi or MinIon) is therefore essential 240 ( Figure 2). Genome assembly quality is assessed using statistics that measure fragmentation of the 241 genome assembly, such as total assembly length, total contig number, contig N50, and scaffold N50. 242 The total length of the assembly represents the total length of all the contigs that are part of the de novo 243 assembled genome. A high total assembly length usually indicates a high-quality genome assembly. 244 The contig N50 expresses the contiguity of the assembled genome. For instance, a contig N50 of 10 245 kilo bases (kb) implies that 50% of the entire genome assembly is contained in contigs that are longer 246 than 10kb. Thus, a high contig N50 value represents a high-quality assembly without too many gaps.  [83]. 253 Another important parameter is the contig L50, which represents the minimum number 254 of contigs required to cover 50% of the total assembly length. N50 and L50 values can be computed 255  (Table  309 2). This is consistent with previous findings that squamate reptiles and birds generally have smaller 310 genomes than mammals (1-3 Gb for squamates vs 1-2 Gb for birds vs 2-6 Gb for mammals) (    due to the high fragmentation of its assembly (which relied on short-reads) [14]. 438

Understanding snake venom evolution through snake genomes
In the future, widespread access to different types of sequencing platforms providing 439 researchers with both short and long reads, complementary tools for genome analysis (Hi-C and 440 CHiCAGO), and higher quality sequence data will likely enable researchers to study snake genomes 441 in greater detail. In turn, this will help elucidate differences and similarities between snake genomes 442 and allow for more fine-grained studies of the structural characteristics of snake venom genes. the gene regulatory network associated with them (recently termed "metavenom network"),, which 500 comprises ~3000 genes that do not code for toxins but actively influence their expression and 501 postgenomic modifications (e.g. protein folding) in the venom gland as housekeeping genes [48]. 502 Interestingly, this network presents highly conserved elements common to even distantly related 503 lineages such as snakes and venomous mammals; on the other hand, snakes (specifically P. flavoviridis 504 and P. mucrosquamatus) also displayed several unique regulatory genes that were likely co-opted 505 together with neofunctionalized toxin genes absent in other lineages [48]. 506 and CRISP gene families, which therefore also displayed a tendency towards accelerated evolution 581 despite being present in far fewer copies [15]. Similarly, a high KA/KS ratio (2.034 ± 0.818) was 582 observed for the 3FTx gene family in the N. naja genome, again pointing towards rapid differentiation 583 and functional diversification for these genes [27]. Conversely, when KA/KS < 1 is indicative of either 584 neutral selection (random substitutions that confer neither evolutionary advantages nor disadvantages) 585 or purifying selection (i.e. removal of mutations that usually tend to be deleterious as they appear in 586 conserved areas of the gene). In the P. flavoviridis genome study, all non-dominant toxin gene families 587 had a KA/KS < 1 (Mean ± SE = 0.512 ± 0.018), indicating a more neutral nucleotide substitution and 588 the maintenance of similarity between gene copies [15]. On the other hand, when examining sequence 589 divergence using venom gland transcriptomes in sidewinder rattlesnakes (Crotalus cerastes), 590 examining sequence divergence using venom gland transcriptomes data showed evidence of stabilizing 591 selection being stabilized, which supports that the maintainance ofing a generalist phenotype is favored 592 [133]. It must, however, be noted that despite various methods available for studying selection (see 593 [134]), relatively few have been applied in for the investigation of ng selection in venom and only in a 594 small number of species [15,27,133,135]. Therefore, additional studies are required before general 595 conclusions can be drawn. 596 New -omics tools and methods are rapidly advancing our knowledge of the mechanisms 597 behind venom evolution [136]. In particular, WGS has introduced advantages to snake venom research, 598 as WGS data can be used to identify structural variants, including inversions ( Fig. 3A-B), insertions 599 ( Fig. 3C), deletions, tandem repeats ( Fig. 3A-C), transposable elements (TEs), and other repeat content 600 [21,137]. An increasing number of studies report venom variation at different levels, such as 601